Accumulatin i Transformations for Hierarchical Linear 
legression HMM Adaptation 



Field of Invention 

This invention relates to speech recognition and more particularly to adaptive speech 
recognition with hierarchical lines r HMM adaptation. 

5 

Background of Invention 

Hierarchical Linear Regression (HLR) (e.g. MLLR [See C. J. Leggetter and P.C. 
Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density 
HMMs, Computer, Speech and Language, 9(2):71-185,1995] is now a common technique to 
CiO transform Hidden Markov Model ; (HMM) models for use in an acoustic environment different 
IS from the one the models are initi d trained. The environments refer to speaker accent, speaker 
!ft vocal tract, background noise, recording device, transmission channel, etc. HLR improves word 
W error rate (WER) substantially by reducing the mismatch between training and testing 
M environments [See C.J. Leggetter ;ited above]. 
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IP Hierarchical Linear Regression (HLR) (e.g. MLLR or Maximum Likelihood Linear 

fij Regression) is an interactive pro* -ess that transforms an initial Hidden Markov Model (HMM) 
!~f models step-by-step into a targe t model Typically, the iteration requires M alignments of 
transformed HMM against speec h data and each alignment results are used to produce new 
20 transformations through N EM re- estimations. Thus MxN steps are required. 

Current methods build the models at m-th step from models at m - 1-th step. Each step 
produces a set of transformation: ■. that is used by the step next to it. To reproduce the target 
models, the Mx ^transformation ; have to be stored and later applied to the initial models. 
25 Current methods build the models at m-th step from models at m - 1-th step. Each step 

produces a set of transformations EM (Expectation Maximization) that is used by the step next to 
it, as illustrated in Figure 1. At he recognition time, to get the target models, two alternatives 
can be considered. The first is to store the model set obtained at the MxN transformations. As 



DC01:269668.1 



typical continuous speech recognizers may use tens of thousands of mean vectors, storing the 
additional parameters of that size is unaffordable for situations as in speech recognition on 
mobile devices. The second is to apply successively the M x N transformations to the initial 
model set, as illustrated by Fig-2. This requires storing the Mx N transformations. Typically the 
5 storage requirement is substantially lower. However, it is still prohibitive for typical embedded 
systems such as a DSP based one. Notice that, as represented by the size of the boxes in Figure 2, 
the number of transformations in each transformation step may be different. 

Summary of Invention 

10 A new method, which builds the models at /ra-th step directly from models at the initial 

step, is provided to minimize the storage and calculation. The method therefore merges the 
Mx N transformations into a single transformation. The merge guarantees the exactness of the 
O transformations and make it possible for recognizers on mobile devices to have adaptation 
m capability. The goal of the method to be described is to provide a single set of transformations 
I; 1 5 which combines all M x N set of transformations, so that a target model at any iteration can be 
W calculated directly from initial model and the single set of transformations. Figure 3 illustrates 
J™ the goal. 

JL. The combination guarantees the exactness of the total transformations, i.e. the resulting 

CP models obtained by the single set of transformations are the same as the target models obtained 
jf]20 by successive applications of transformations. This results make it possible for recognizers on 
?*t mobile devices to have adaptation capability. 

Description of Drawings: 

25 Figure 1 illustrates types of iterations; 

Figure 2 illustrates Target models are obtained by successive application of several set of 
transformations Ti, T 2 . 

Figure 3 illustrates Target models are obtained by a single application of one set of 
transformations T. 
30 Figure 4 illustrates part of a regression tree. 

Figure 5 illustrates the operation according to one embodiment of the present invention. 
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Figure 6 illustrates the system according to one embodiment of the present invention. 
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Description of Preferred Embodiment 

In accordance with the present invention the method builds the models at the m-th step 
directly from models at the initial step by successive applications of transformations as 
illustrated in Figures 3 and 5. The algorithms for providing this are derived herein in the 
5 following. 

Let 5 = {£ 19 | 2 <^N} be the set of nodes of the regression tree. Leaf nodes QcSof 
the tree correspond to a class. A class can be either a HMM, a cluster of distributions, a state 
PDF, etc., depending on the adaptation scheme. A leaf node a e Q is assigned the number m(a 
10 i) of acoustic vectors associated to the node at iteration J. 

For illustration, Figure 4 shows part of a tree with leaves corresponding to phone HMM. 
We introduce the function 

^ : 5 h-> S 

such that £ . = fidt) i & j is the root of the node . Similarly, we introduce the function 

(p :Sx[0,l]h^ S 




such that £ = <j){(p(£4)) , i.e. is the z-th descendent of the node £ . 

25 

At each iteration of parameter estimation, to each node is associated a number p(%J) 
recording the total number of input vectors under the node. 

\m{^i) if 
P (& 0 - p ^ ^ otherwise 
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A node is called reliable if 
5 />(£0>P 

where P is a constant, fixed for each alignment. The function 
10 ^:SxNh> [False, True] 

such that y/(£4) indicates if a node is reliable at the z'-th iteration. Note that at each iteration, as 
the alignment between leaf nodes and speech signals may change, y/ is a function of i. Only 
reliable nodes are assigned a transformation . Each leaf node, e.g. each HMM, has its 
transformation located on the first reliable node given by recursively tracing back to the roots. 

Another function we introduce is 

%\ HxNh^S 

such that £ = x(%9 0 i s ^ e fi rst root no &° of £ that satisfies y/ (g 9 i) = True. 

We use general form for linear regression transformation, which applies a linear 
25 transformation T to the mean vector of Gaussian distributions: 

£=T(//) = A// + B 
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where A is a D x D matrix, and ; a D-dimensional column vector, and B a D-dimensional column 
vector. We assume that at any step, the current model is always obtained by transforming the 
initial model. I.e. We always map the original models: 

5 VnV^// n =T H)f (// 0 )=A B ^ 0 +B^ 

Referring to Figure. 1, we distinguish two types of parameter estimation iterations: 
between EM and between alignment iterations. Correspondingly, in the next two sections we 
10 will study two types of transformation combinations: 

• Transformation accumulation between EM estimations. 

• Transformation accumulation between alignment iterations. 
Transformation accumulation between EM estimations 



Given 

• The set of transformations that maps the initial models through n - 1 EM re-estimations 
(global at n - 1). 



• The set of transformations that maps the models at n - 1-th iteration to the models at the 
iteration n (local at ri). 

25 

We want to find the set of accumulated transformations, global at n, which combines the 
global at n - 1 and local at n (local at ri). 

As between two EM the alignment is fixed, the reliable node information is unchanged. 
30 Therefore the association between nodes and transformations is fixed from at the two EM 
iterations. 
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At any given alignment, for each ^eE and y/(^J) , let K n _ u and B B _ 1|ff be the global 
transformation derived at EM iteration n - 1, and A n4 and B n 4 be the local transformation 

derived at EM iteration n. Then a single transformation A B ^ and B n 4 is combined from the two 
transformations. 

Proposition 1 V£ e E a K£>0 where a is global, 



With 



^n4 ~~ ^n\^n-\£ 



PROOF: 

The case « = 1 corresponds to a single transformation and correctness of Eq-5 is obvious. For n 
> 1, using Eq-3: 

= A ^(A„.i,f/"o+ B „-u) + B ^ 

= (A,, tf A„_ w )// 0 + (A^B_ W +B Brf ) 

= A „ ; ^o+ B ^ 

Transformation accumulation between alignment iterations 
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Given 



• The set of transformations that maps the initial models through i - 1 alignments (global at 
i-l). 

5 

• The set of transformations that maps the models at i - 1-th alignment to the models at the 
alignment i (local at i). 



We want to find the set of accumulated transformations, global at z, which combines the 
10 global at i - 1 and locate at i transformations. 



Different from the accumulation between two EM iterations, the alignment here may be changed, 
O which results in a change in the reliable node information. Therefore the association between 
IS nodes and transformations cannot be assumed fixed from i - 1 to z-th alignment. For instance, 
£j5 the number of transformations at / is different from that at / - 1, for two reasons: 

,p • The value of P in Eq-2 may be different. Typically, P is decreased to increase the 

"n number of transformations as i increases. 



1120 • Even if P is kept constant a cross alignment, as the acoustic model parameters are 
71 changed at each alignment, p(£ 9 i) may change as function of/, so will y/(^J) ♦ 



The combined set of transformations is specified by Eq-10. 



25 Proposition 2 V^eS: 



V = 



None Otherwise 
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PROOF: 

V^eS, only one of four situations can happen: 

1 . It is a reliable node at both iterations i - 1 and i. The parameters of the models under this 
node are therefore transformed by TT 1 and then by . 

2. It is a reliable node at iteration i - 1 but not at iteration L The transformation at i is 
therefore the one at the node x(£ 9 i) . The parameters of the models under node £ are 
therefore transformed by T^ 1 and then by T^ i) . 

3. It is a reliable node at iteration i but not at iteration i - 1 . The transformation at / - 1 is 
therefore the one at the node i - 1) . The parameters of the models under node £ are 

therefore transformed by T^" 1 - 1) and then by T^" 1 . 

4. It is not a reliable node at both iterations. The node has therefore no transformation. 

In the fourth case, no transformation will be generated. 

Referring to Figure 6 ? there is illustrated a system according to one embodiment of the 
present invention wherein the input speech is compared to models at recognizer 60 wherein the 
models 61 are HMM models that have had HLR HMM adaptation or training using only a single 
set of transformation parameters wherein for transformation accumulations between EM 
estimations equation 4 is used and for transformation accumulation between alignment iterations 
is according to equation 10. 
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